data point
Localizing Memorization in SSL Vision Encoders
Recent work on studying memorization in self-supervised learning (SSL) suggests that even though SSL encoders are trained on millions of images, they still memorize individual data points. While effort has been put into characterizing the memorized data and linking encoder memorization to downstream utility, little is known about where the memorization happens inside SSL encoders. To close this gap, we propose two metrics for localizing memorization in SSL encoders on a per-layer (LayerMem) and per-unit basis (UnitMem). Our localization methods are independent of the downstream task, do not require any label information, and can be performed in a forward pass. By localizing memorization in various encoder architectures (convolutional and transformer-based) trained on diverse datasets with contrastive and non-contrastive SSL frameworks, we find that (1) while SSL memorization increases with layer depth, highly memorizing units are distributed across the entire encoder, (2) a significant fraction of units in SSL encoders experiences surprisingly high memorization of individual data points, which is in contrast to models trained under supervision, (3) atypical (or outlier) data points cause much higher layer and unit memorization than standard data points, and (4) in vision transformers, most memorization happens in the fully-connected layers. Finally, we show that localizing memorization in SSL has the potential to improve fine-tuning and to inform pruning strategies.
2D-OOB: Attributing Data Contribution Through Joint Valuation Framework
Data valuation has emerged as a powerful framework for quantifying each datum's contribution to the training of a machine learning model. However, it is crucial to recognize that the quality of cells within a single data point can vary greatly in practice. For example, even in the case of an abnormal data point, not all cells are necessarily noisy. The single scalar score assigned by existing data valuation methods blurs the distinction between noisy and clean cells of a data point, making it challenging to interpret the data values. In this paper, we propose 2D-OOB, an out-of-bag estimation framework for jointly determining helpful (or detrimental) samples as well as the particular cells that drive them. Our comprehensive experiments demonstrate that 2D-OOB achieves state-of-the-art performance across multiple use cases while being exponentially faster. Specifically, 2D-OOB shows promising results in detecting and rectifying fine-grained outliers at the cell level, and localizing backdoor triggers in data poisoning attacks.
- North America > United States > Oregon (0.04)
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- North America > United States > California > Orange County > Anaheim (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
8493eeaccb772c0878f99d60a0bd2bb3-AuthorFeedback.pdf
We thank all the reviewers for carefully checking the paper and acknowledging the "efficiency and practicality" of We will also clarify in the revised version. R1 asks for discussion of similarity and difference of technical results to [19]. Hence, training on the medoids is robust to noisy labels. Indeed, Eq. 5 finds the best subset of Fraction of clean data points in coreset. R2 asks how clean the coreset is.
- North America > United States > Oregon (0.04)
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- North America > United States > California > Orange County > Anaheim (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)